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Abstract —The focus of this paper is on solving multi- 
rohot planning prohlems in continuous spaces with partial 
ohservahillty. Decentralized partially ohservahle Markov de¬ 
cision processes (Dec-POMDPs) are general models for multi- 
rohot coordination prohlems, hut representing and solving Dec- 
POMDPs is often intractable for large prohlems. To allow 
for a high-level representation that is natnral for multi- 
rohot prohlems and scalable to large discrete and continu¬ 
ous problems, this paper extends the Dec-POMDP model to 
the decentraUzed partially observable semi-Markov decision 
process (Dec-POSMDP). The Dec-POSMDP formulation allows 
asynchronons decision-making by the robots, which is crucial in 
mnltl-robot domains. We also present an algorithm for solving 
this Dec-POSMDP which is much more scalable than previous 
methods since it can incorporate closed-loop belief space macro¬ 
actions in planning. These macro-actions are automatically 
constructed to produce robust solutions. The proposed method’s 
performance is evaluated on a complex multi-robot package 
delivery problem under nncertalnty, showing that our approach 
can naturally represent mnlti-robot problems and provide high- 
quality solutions for large-scale problems. 

I. Introduction 

Many real-world multi-robot coordination problems oper¬ 
ate in continuous spaces where robots possess partial and 
noisy sensors. In addition, asynchronous decision-making 
is often needed due to stochastic action effects and the 
lack of perfect communication. The combination of these 
factors makes control very difficult. Ideally, high-quality 
controllers for each robot would be automatically generated 
based on a high-level domain specification. In this paper, we 
present such a method for both formally representing multi¬ 
robot coordination problems and automatically generating 
local planners based on the specification. While these local 
planners can be a set of hand-coded controllers, we also 
present an algorithm for automatically generating controllers 
that can then be sequenced to solve the problem. The result is 
a principled method for coordination in probabilistic multi¬ 
robot domains. 

The most general representation of the multi-robot co¬ 
ordination problem is the decentralized partially observ¬ 
able Markov decision process (Dec-POMDP) [8]. Dec- 
POMDPs have a broad set of applications including net¬ 
working problems, multi-robot exploration, and surveillance 
[9], [10], [20], [19]. Unfortunately, current Dec-POMDP 
solution methods are limited to small discrete domains and 
require synchronized decision-making. This paper extends 
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Fig. 1. Package delivery domain with key elements labeled. 


some promising recent work on incorporating macro-actions, 
temporally extended actions, [4], [5] to solve continuous 
and large-scale problems which were infeasible for previous 
methods. 

Macro-actions (MAs) have provided increased scalability 
in single agent MDPs [17] and POMDPs [1], [11], but they 
are nontrivial to extend to multi-agent settings. Some of the 
challenges in extending MAs to decentralized settings are: 

• In the decentralized setting, synchronized decision-making 
is problematic (or even impossible) as some robots must re¬ 
main idle while others finish their actions. The resulting solu¬ 
tion quality would be poor (or not implementable), resulting 
in the need for MAs that can be chosen asynchronously by 
the robots (an issue that has not been considered in the single 
agent literature). 

• Incorporating principled asynchronous MA selection is a 
challenge, because it is not clear how to choose optimal MAs 
for one robot while other robots are still executing. Hence, a 
novel formal framework is needed to represent Dec-POMDPs 
with asynchronous decision-making and MAs that may last 
varying amounts of time. 

• Designing these variable-time MAs also requires charac¬ 
terizing the stopping time and probability of terminating at 
every goal state of the MAs. Novel methods are needed that 
can provide this characterization. 

MA-based Dec-POMDPs alleviate the above problems by 
no longer attempting to solve for a policy at the primitive 
action level, but instead considering temporally-extended 
actions, or MAs. This also addresses scalability issues, as 
the size of the action space is considerably reduced. 

In this paper, we extend the Dec-POMDP to the decen¬ 
tralized partially observable yemi-Markov decision process 
(Dec-POSMDP) model, which formalizes the use of closed- 
loop MAs. The Dec-POSMDP represents the theoretical 
basis for asynchronous decision-making in Dec-POMDPs. 




We also automatically design MAs using graph-based plan¬ 
ning techniques. The resulting MAs are closed-loop and the 
completion time and success probability can be characterized 
analytically, allowing them to be directly integrated into 
the Dec-POSMDP framework. As a result, our framework 
can generate efficient decentralized plans which take advan¬ 
tage of estimated completion times to permit asynchronous 
decision-making. The proposed Dec-POSMDP framework 
enables solutions for large domains (in terms of state/action/ 
observation space) with long horizons, which are otherwise 
computationally intractable to solve. We leverage the Dec- 
POSMDP framework and design an efficient discrete search 
algorithm for solving it, and demonstrate the performance of 
the method for the complex problem of multi-robot package 
delivery under uncertainty (Fig. [T]). 

II. Problem Statement 

A Dec-POMDP [8] is a sequential decision-making prob¬ 
lem where multiple agents (e.g., robots) operate under un¬ 
certainty based on different streams of observations. At 
each step, every agent chooses an action (in parallel) based 
purely on its local observations, resulting in an immediate 
reward and an observation for each individual agent based 
on stochastic (Markovian) models over continuous states, 
actions, and observation spaces. 

We define a notation aimed at reducing ambiguities when 
discussing single agents and multi-agent teams. A generic 
parameter p related to the i-th agent is noted as p^‘^\ 
whereas a joint parameter for a team of n agents is noted as 
P = ^). Environment parameters or those 

referring to graphs are indicated without parentheses, for 
instance p* refers to a parameter of a graph node, and to 
a parameter of a graph edge. 

Formally, the Dec-POMDP problem we consider in this 
paper is described by the following elements: 

• I = {1, 2, • • • , n} is a finite set of agents’ indices. 

• S is a continuous set of joint states. Joint state space 
can be factored as S = X x X® where X® denotes the 
environmental state and X = x,XW is the joint state 
space of robots, with X^®^ being the state space of the 
Ath agent. is a continuous space. We assume X® 
is a finite set. 

• U is a continuous set of joint actions, which can be 
factored as U = XiU^®^, where is the set of actions 
for the z-th agent. 

• State transition probability density function is denoted 
as p(s'|s, m), that specifies the probability density of 
transitioning from state s G S to s' G S when the actions 
ft G U are taken by the agents. 

• R is a reward function: : S x U —>■ K, the immediate 
reward for being in joint state a: G X and taking the 
joint action ft G U. 

• is a continuous set of observations obtained by all 
agents. It can be factored as = Z x Z®, where 
Z = XiZ(®) and Z® = XiZ®(®). The set Z(®) x Z®(®) 
is the set of observations obtained by the z-th agent. 
Z®(®) is the observation signal that is a function of the 



Fig. 2. Hierarchy of the proposed planner. In the highest level, a 
decentralized planner assigns a TMA to each robot. Each TMA encompasses 
a specific task (e.g. picking up a package). Each TMA in turn is constructed 
as a set of local macro-actions (LMAs). Each LMA (the lower layer) is a 
feedback controller that acts as a funnel. LMAs funnel a large set of beliefs 
to a small set of beliefs (termination belief of the LMA). 


the environmental state a;® G X®. We assume the set of 
environmental observations Z®^®^ is a finite set for any 
agent z. 

• Observation probability density function h{d\s,u) en¬ 
codes the probability density of seeing observations 
o G fi given joint action zz G U taken and resulted 
in joint state s G S. 


Note that a general Dec-POMDP need not have a factored 
state space such as the one given here. 

The solution of a Dec-POMDP is a collection of de¬ 
centralized policies fj — '>). Because (in 

general) each agent does not have access to the observations 
of other agents, each policy maps the individual data 
history (obtained observations and taken actions) of the z-th 


agent into its next action: ul = where = 

r (i) (i) F) (») F) (i) h)^ 

According to above definition, we can define the value 
associated with a given policy fj starting from an initial joint 
state distribution b: 


V^ib) = E 


OO 

^7‘.R(st,Mt)|77,p(so) = b 

.4=0 


( 1 ) 


Then, a solution to a Dec-POMDP formally can be defined 
as the optimal policy: 

fj* = arg max (2) 

V 

The Dec-POMDP problem stated in is undecidable 
over continuous spaces without additional assumptions. Re¬ 
cent work has extended the Dec-POMDP model to incorpo¬ 
rate macro-actions which can be executed in an asynchronous 
manner [5]. In planning with MAs, decision making occurs 
in a two layer manner (see Fig.|^. A higher-level policy will 
return a MA for each agent and the selected MA will return a 
primitive action to be executed. This approach is an extension 
of the options framework [17] to multi-agent domains while 
dealing with the lack of synchronization between agents. The 
options framework is a formal model of MAs [17] that has 









been very successful in aiding representation and solutions 
in single robot domains [12]. Unfortunately, this method 
requires a full, discrete model of the system (including 
macro-action policies of all agents, sensors and dynamics). 

As an alternative, we propose a Dec-POSMDP model 
which only requires a high-level model of the problem. 
The Dec-POSMDP provides a high-level discrete planning 
formalism which can be defined on top of continuous spaces. 
As such, we can approximate the continuous multi-robot 
coordination problems with a tractable Dec-POSMDP for¬ 
mulation. Before defining our more general Dec-POSMDP 
model, we first discuss the form of MAs that allow for 
efficient planning within our framework. 

III. Hierarchical Graph-based Macro-actions 

This section introduces a mechanism to generate complex 
MAs based on a graph of lower-level simpler MAs, which 
is a key point in solving Dec-POSMDPs without explicitly 
computing success probabilities, times, and rewards of MAs 
in the decentralized planning level. We refer to the generated 
complex MAs as Task MAs (TMAs). 

To clarify the concepts, before describing task macro¬ 
actions, we distinguish between open-loop and closed-loop 
MAs. In general, MAs refer to temporally-extended actions 
[18]. An (-step long open-loop MA is a sequence of pre¬ 
defined actions such as by uq-i = {uq, ui, • • • ,ui}. However, 
a closed-loop MA is policy 7r(-), i.e., a mapping from 
histories to actions. As we discuss in the next section, for 
a seamless incorporation of MAs in Dec-POMDP planning, 
we need to design closed-loop MAs, which is a challenge in 
partially-observable settings. 

Generating a closed-loop MA that accomplishes a single¬ 
agent task such as picking up an object, delivering a package, 
opening a door, and so on, itself requires solving a POMDP 
problem. In this paper, we utilize information roadmaps 
[1] as a substrate to generate such task MAs. We start by 
discussing the structure of feedback controllers in partially- 
observable domains. 

A macro-action for the i-th agent maps the histories 
H'^ of actions and observations that have occurred to 
actions. Note that the environmental observations are only 
obtained when the MA terminates. We compress this history 
into a belief with joint belief for the 

team denoted by 6 = It is well known 

[13] that making decisions based on belief is equivalent 
to making decisions based on the history ' in a POMDP. 

For any given agent, a feedback controller in partially- 
observable environment comprises a Bayesian filter bk+i = 
T{bk,Uk, Zk+i) that evolves the belief and a separated con¬ 
troller Uk+i = p{bk+i) that generates control signals based 
on the current belief (figure). Therefore, a feedback controller 
C in belief space can be viewed as a function that maps the 
current belief bk, control Uk, and observation Zk+i to the pair 
of next belief bk+i and control Uk+i', i.e., {bk+i,Uk+i) = 

C{bk,Uk,Zk+i) = (r(6fc,Ufc,Zfe+i),/i(r(6fe,itfc,Zfc+i))). 

A local MA we consider herein is a feedback controller 
that is effective (has basin of attraction) locally in a region of 


the state/belief space. Many controllers that rely on lineariza¬ 
tion fall into this category as the linearization assumption 
is valid in locally around the linearization point. The goal 
of LMA in the partially-observable setting is to drive the 
system’s belief to a particular belief. In [1] it has been shown 
that in Gaussian belief space, utilizing a combination of 
Kalman filter and linear controllers, the system’s belief can 
be steered toward certain probability distributions. In other 
words, LMAs act like a funnel in belief space. Starting from a 
belief in the mouth of funnel (see Fig. [^, the LMA drives the 
belief toward a target belief that is referred to as a milestone 
herein. 

Since LMAs act locally on belief space, we can locally 
linearize the system and design corresponding simple LMAs. 
In this paper, we assume the belief space is Gaussian and 
thus belief can be represented with a mean vector x'^ and 
covariance matrix denoted as 5 = P'*'). For a 

given state point v, we linearize the nonlinear process and 
measurement equations to get a stationary linear system 
with Gaussian noises. Associated with this linear system, 
we design a stationary Kalman filter and a linear separated 
controller, /i(6) = —— v). Thus, the utilized LMA 
is parametrized by feedback gain matrix L and point v; 
i.e., p(5;6>), where 6 = (L,v). It can be shown that under 
appropriate choice of L and mild observability conditions, 
this Linear LMA acts as a funnel in belief space that drives 
the belief toward the milestone b = {x,P), where x = v 
and P is the solution of the Riccati equation corresponding 
to the Kalman filter [1]. 

A chain of funnels is a sequence of funnels where the 
target belief of each funnel falls into the mouth (or pre¬ 
image) of the next funnel in the chain. A richer way of 
combining funnels is via graphication (to form a graph of 
funnels). An information roadmap is defined as a graph of 
funnels, where each node of this graph is a milestone and 
each edge is an LMA funnel. 

To construct a graph of LMAs, we sample a set of 
parameters {0^} and generate corresponding LMAs {C^}- 
Associated with the j-th LMA, we compute the j-th mile¬ 
stone y. We define the j-th node of our LMA graph as an 
e-neighborhood around the milestone; i.e., = {b : \\b — 

y\\ < e} and the set of all nodes as V = {B^}. We connect 
B^ to its fc-nearest neighbors via their corresponding LMAs. 
If neighboring nodes i and j are so far from each other that 
the j-th LMA cannot take the belief from i to j (since the 
linearization used to construct C? is not valid around B®), 
we utilize an edge controller (as detailed in [1]). An edge 
controller is a finite-time controller whose role is to take the 
mean of distribution close enough to the node B^ via tracking 
a trajectory that connects v® to in the state space. Once 
the distribution mean gets close enough to the target node, 
the system’s control is handed over to the funnel associated 
with the target node. We denote the concatenation of the edge 
controller and the funnel utilized to take the belief from B® 
to B-J by DP which is defined as the the ((, j)-th graph edge. 
The set of all edges are denoted by L = We denote 

the set of available LMAs at B® by L(i). To incorporate 


the lower-level state constraints (e.g., obstacles) and control 
constraints, we consider as a hypothetical node, hitting 
which represents violation of constraints. We add to the 
set of nodes V. Therefore, taking any there is a chance 
that system ends up in 5°. 

We can simulate the behavior of LMA at offline 
and compute the probability of landing in any given node 
B’', which is denoted by P{B'^\B'‘, Similarly, we can 

compute the reward of taking LMA at offline, which 
is denoted by R{B'^,C^^) and defined as the sum of one- 
step rewards under this LMA. Finally, by 
we denote the time it takes for LMA to complete its 
execution starting from B'^. 

A. Utilizing TMAs in the Decentralized Setting 

In a decentralized setting, the following properties of 
the macro-action need to be available to the high-level 
decentralized planner: (i) TMA value from any given initial 
belief, (ii) TMA completion time from any given belief, 
and (Hi) TMA success probability from any given belief. 
What makes computing these properties challenging is the 
requirement that they need to be calculated for every possible 
initial belief. Every belief is needed because when one 
agent’s TMA terminates, the other agents might be in any 
belief while still continuing to execute their own TMA. This 
information about the progress of agents’ TMAs is needed 
for nontrivial asynchronous TMA selection. 

In the following, we discuss how the graph-based structure 
of our proposed TMAs allows us to compute a closed-form 
equation for the success probability, value, and time. As a 
result, when evaluating the high-level decentralized policy, 
these values can be efficiently retrieved for any given start 
and goal states. This is particularly important in decentralized 
multi-agent planning since the state/belief of the j-th agent 
is not known a priori when the MA of i-th agent terminates. 

A TMA policy is defined as a policy that is found by 
performing dynamic programming on the graph of LMAs. 
Consider a graph of LMAs that is constructed to perform a 
simple task such as open-the-door, pick-up-a-package, move- 
a-package, etc. An important feature of this graph is that it 
is multi-query, meaning that it is valid for any starting and 
goal belief. Depending on the goal belief of the task, we can 
solve the dynamic programming problem on the LMA graph 
that leads to a policy which achieves the goal while trying 
to maximize the accumulated reward and taking into account 
the probability of hitting failure set B^. Formally, we need 
to solve the following DP: 

V*{B\ n*) =max (r{B\L) +^P{B^\B\ C)V*{B^))^ , Mi 

ceL{i) , 

( 3 ) 

7r*(B*) = arg max [r{B\ C)+^P{B^\B\ C)V* {B^)^ .Mi 
ceL{i) ^ 

where E*( ) is the optimal value defined over the graph 
nodes with V (59°“*) set to zero and V(B^) set to a suitable 
negative reward for violating constraints. 7r*(-) is the result¬ 
ing TMA. The primitive actions can be retrieved from TMA 


via a two-stage computation: TMA picks the best LMA at 
each milestone and LMA generates the next action based on 
the perceived observations until belief reaches the next mile¬ 
stone, i.e., Zt/e, Z/e+l) — TT (5) (6^, U/j;, 2 /^+ 1 ) 

where B is the last visited milestone and £ — 7r*(B) is the 
best LMA chosen by TMA at milestone B. The space of 
TMAs is denoted as T = {tt}. 

For a given optimal TMA tt*, the associated optimal value 
F*(5*,7r*) from any node 5® is computed via solving 
Also, using Markov chain theory we can analytically 
compute the probability P(59°“*|5®, tt*) of reaching the 
goal node 5®°“* under the optimal TMA tt* starting from 
any node 5® in the offline phase [1]. 

Similarly, we can compute the time it takes for the TMA 
to go from 5® to 5®°“* under tt* as follows: 

r9(5®;7r*) = T{B\tt*{B^)) 

-f ^P(5^|5®,7r*(5®))TS(5^;7r*), Mi (4) 

3 

where T^{B\tt) denotes the time it takes for TMA tt to 
take the system from B to TMA’s goal. Defining T® = 
T(5*;7r*(5*)) and f = (r^r^ • • • ,r”)^ we can write 
Q in its matrix form as: 

= T + Pfo ^ fo = {I - P)-^f (5) 

where T® is a column vector with z-th element equal to 
T9(5®;7r*) and P is a matrix with (i, L-th entry equal to 
P{B^\B\7t*{B^)). 

Therefore, a TMA can be used in a higher-level planning 
algorithm as a MA whose success probability, execution 
time, and reward can be computed offline. 

B. Environmental State and Updated Model 

We also extend TMAs to the multi-agent setting where 
there is an environmental state that is locally observable by 
agents and can be affected by other agents. 

We denote the environment state (e-state) at the fc-th 
time step as € X°. It encodes the information in 
the environment that can be manipulated and observed by 
different agents. We assume x% is only locally (partially) 
observable. An example for x^. in the package delivery 
application (presented in Section |V]| is “there is a package 
in the base”. An agent can only get this measurement if the 
agent is in the base (hence it is partial). 

Any given TMA tt is only available at a subset of e- 
states, denoted by X°(7r). In many applications X°(7r) is a 
small finite set. Thus, we can extend the cost and transition 
probabilities of TMA tt for all x^ G X° by performing the 
TMA evaluation described in Section Ilil-AI for all x^ G X°. 

We extend transition probabilities P(59°“*|5, tt) to take 
the e-state into account, i.e., P{B^°°‘^,x^ |6, x^jTt), which 
denotes the probability of getting to the goal region 5®°“* 
and e-state a;° starting from belief b and e-state x^ under the 
TMA policy tt. Similarly, the TMA’s value function V{b, tt) 
is extended to V{b,x^ ,tt), for all x® G X°(7r). 

The joint reward P(x, x^, u) encodes the reward obtained 
by the entire team, where x = , ■ ■ ■ is the set of 




states for different agents and u = • • • , is the set 

of actions taken by all agents. 

We assume the joint reward is a multi-linear function of 
a set of reward functions and R^, where 

only depends on the j-th agent’s state and depends 
on all the agents. In other words, we have; 

R{x,x^,u) = g 

••• «(")), It)) (6) 

In multi-agent planning domains, often computing R^ is 
computationally less expensive than computing R, which 
is the property we exploit in designing the higher-level 
decentralized algorithm. 

The joint reward R(b,x'^,u) encodes the reward obtained 
by the entire team, where b = is the joint 

belief and u is the joint action defined previously. 

Similarly, the joint policy ^ • • • , is the set 

of all decentralized policies, where is the decentralized 
policy associated with the z-th agent. In the next section, we 
discuss how these decentralized policies can be computed 
based on the Dec-POSMDP formulation. 

Joint value V(b, x®, encodes the value of executing the 
collection ^ of decentralized policies starting from environ¬ 
ment state Xq and initial joint belief b. 

IV. The Dec-POSMDP Framework 


In this section, we formally introduce the Dec-POSMDP 
framework. We discuss how the use of TMAs can transform 
a continuous Dec-POMDP to a Dec-POSMDP over finite 
number of MAs. This transformation allows discrete domain 
algorithms to generate a decentralized solution for general 
continuous problems. 

We denote the high-level decentralized policy for the i- 
th agent by : 5^®^ —)■ where is the macro¬ 

action history for the z-th agent (as opposed to the action- 
observation history), which is formally defined as: 


"(i) 

“fc 


_Z e.(i) I'O ttV‘) ’=A‘) (V jr. . 

(i)l 


(0 


4*- 


(0 




which includes the chosen macro-actions {tt).'}, the action- 
observation histories under chosen macro-actions 
and the environmental observations received at the 

termination of macro-actions. Accordingly, we can define a 
joint policy ^ (jP'\ ..., for all agents and a 

joint macro-action policy as tt = ..., 

Each time an agent completes a TMA, it receives an 
observation of the environmental state x®, denoted by o®. 
Also, due to the special structure of the proposed TMAs, 
we can record the agent’s final belief b^ at TMA’s ter¬ 
mination (which compresses the entire history H under 
that TMA). We denote this pair as d = ( 6 -^, 0 ®). As a 
result, we can compress the macro-action history as 5*^*^ = 

(i) (i) u(i) (i) „(i) (i) w(i)-. 


{ 


The value of joint policy 


V^{b) = E 


'^y*R{st,Ut)\p{so) = 6,(/),{7r} 
.*=0 


( 7 ) 


but it is unclear how to evaluate this equation without 
a full (discrete) low-level model of the domain. Even in 
that case, it would often be intractable to calculate the 
value directly. Therefore, we will formally define the Dec- 
POSMDP problem, which has the same goal as the Dec- 
POMDP problem (finding the optimal policy), but in this 
case we seek the optimal policy for choosing macro-actions 
in our semi-Markov setting: 

4>* = arg max V'^ ( 8 ) 

0 


Definition 1: (Dec-POSMDP) The Dec-POSMDP frame¬ 
work is described by the following elements 

• I = {1, 2, • • • , rz} is a finite set of agents’ indices. 

. X X ... X B(") X X® is the underlying state 

space for the proposed Dec-POSMDP, where B^'^ is the 
set of beliefs of z-th agent’s TMAs (i.e., T^*)). 

, T = X ... X is the space of high-level 
actions in Dec-POSMDP, where is the set of TMAs 
for the z-th agent. 

• P(6',x® ,A:|d, x®,7f) denotes the transition probability 
under TMAs tt from a given 6, x® to 6', x® as described 
below. 

• X®, 7f) denotes the reward/value of taking TMA tt 
at 6, X® as described below. 

• O is the set of environmental observations. 

• P(d|d, X®) denotes the observation likelihood model. 

Again, a general Dec-POSDMP need not have a factored 
state space. Also, while the state space is very large, macro¬ 
actions allow much of it to be ignored once these high-level 
transition, reward and observation functions are calculated. 

Below, we describe the above elements in more details. 
For further explanation and derivations please see [16]. The 
planner ■ ■ ■ , that we construct in this 

section is fully decentralized in the sense that each agent z 
has its own policy —)■ that generates the next 

MA based on the history of MAs taken and the observation 
perceived solely by the z-th agent. 

Each policy is a discrete controller, represented by a 
(policy) graph [3], [14]. Each node of the discrete controller 
corresponds to a MA. Note that different nodes in the 
controller could use the same MA. Each edge in this graph is 
an d. An example discrete controller for a package delivery 
domain is illustrated in Eig. 

Consider a set of MAs n = , • • • , for the 

entire team. Incorporating terminal conditions for different 
agents, we can evaluate the set of decentralized MAs until 
at least one of them stops as: 


E^( 6 ,x®, 7 f)=E 


^ 7 *i?(xfe, xl, Uk)\TT,pixo) = b, xg =x® 

_ 


'^here = minmin{f : bf'> € ( 9 ) 

i t 


It can be shown that the probability of transitioning 
between two configurations (from 6 , x® to 6 ',x®) after k 






steps under the set of MAs tt is given by: 

P(b',x^' ,k\b,x^,Tr) = P{xi ,bk\x^,bo,Tf) 

= Pi^k\xl-i,^{bk-i))x 

P{b'k\xl-i,h-i,^{bk-i))Pixl_^,bk-i\x^, bo, if)] (10) 


Joint value V'^(b,x^) then encodes the value of exe¬ 
cuting the collection ([> of decentralized policies starting 
from environment state Xq and initial joint belief bo. The 
below equation describes the value transformation from the 
primitive actions to MAs, which is vital for allowing us to 
efficiently perform evaluation. Details of this derivation can 
be found in [16]. 


VHbo,x^o)=^ 


= E 


'^'y*R{xt,Xt,Ut)\(l),bo,x\ 

,t^0 
oo 

T'*'' bo, a:o 


U'=0 


( 11 ) 


where = minimin({f > tk_i : 6^*^ S 5 ( 0 . 9 °“*} ^nd 


TT = (j){b, x^). 


( 12 ) 


The dynamic programming formulation corresponding to 
the defined joint value function over MAs is: 

V^{b,x<^) = i^^(6,a:^7f)+ 

OO 

H P{b',x^',k\b,x^,Ti)V^{b',x<^') (13) 

fc —0 b'.x^' 


Algorithm 1: MMCS 


t Procedure : MMCS(T,Ard) 

2 input : set of TMAs, T, default number of best policies 
to check in each iteration, K^i 

3 output : decentralized policy 0 

4 foreach agent i do 

5 maskedP^ ^ setToFalse(); 

6 -ir- null; 


7 

8 
9 

ro 

rt 

r2 

r3 


for iterMMCS = 1 to itermax.MMCS do 
for itexMC = 1 fo itermax.MC do 
4^new f 4^d; 
foreach agent i do 

foreach (tt, d) e T x O do 
if not maskeSP (tt, d) then 
L (/>neii;(7r,o) ^sample(T(7r, o)); 


14 

15 


0/ist •^PP^r^d(0neT(;)5 

F^(6,a;^).append(evalPolicy(0^eii;)); 


16 

17 

18 

19 

20 


^getBestKPolicies ,V^{h,x^),K)\ 
(masked^ (pd) ^ createMask(</)/isi); 

K ^ U 

0 ■^getBestKPolicies(0/^st, V^{b,x‘^),K); 


21 return (f>; 


observation dk, and following the appropriate edge to the 
next TMA, iTk+i - Fig.|^shows part of a single robot’s policy. 


The critical reduction from the continuous Dec-POMDP 
to the Dec-POSMDP over a finite number of macro-actions 
is a key factor in solving large Dec-POMDP problems. In 
the following section we discuss how we compute a decen¬ 
tralized policy based on the Dec-POSMDP formulation. 

A. Masked Monte Carlo Search (MMCS) 

In this section, we propose an efficient method, referi'ed to 
as Masked Monte Carlo Search (MMCS), to generate a de¬ 
centralized multi-agent solution to solve the Dec-POSMDP. 
As demonstrated in Section |V] MMCS allows extremely 
large problems to be solved. It uses an informed Monte Carlo 
policy sampling scheme to achieve this, exploiting results 
from previous policy evaluations to narrow the search space. 

Because the infinite-horizon problem is undecidable [15], 
infinite-horizon methods typically focus on producing ap¬ 
proximate solutions given a computational budget [14], [3]. 
A Monte Carlo approach can be implemented by repeatedly 
randomly sampling from the policy space and retaining the 
policy with the highest expected value as an approximate 
solution. The search can be stopped at any point and the 
latest iteration of the approximate solution can be utilized. 

The MMCS algorithm is detailed in Alg. [I] MMCS uses a 
discrete controller to represent the policy, (j)^, of each agent. 
The nodes of the discrete controller are TMAs, tt G T, and 
edges are observations, d G O. Each agent i transitions in 
the discrete controller by completing a TMA tt^, seeing an 


MMCS initially generates a set of valid policies by ran¬ 
domly sampling the policy space (Line [T 3 ] l, while adhering 
to the constraint that the termination set of a TMA node in 
the discrete controller must intersect the initiation set of its 
child TMA node. The joint value, V'^(b, x®), of each policy 
given an initial joint belief, b, and e-state, x®, is calculated 
by repeatedly applying the policy (either in a simulator or a 
real-world system) and taking its expected value (Line 15 1 . 


MMCS identifies the TMA transitions that occur most 
often in the set of policies with the K highest joint values, 
in a process called ‘masking’ (Line [TS] ). It then opts to 
include these transitions in future iterations of the policy 
search by explicitly checking for the existence of a mask 
for that transition, preventing the transition from being re¬ 
sampled (Line [T 2 ] i. Note that the mask is not permanent, as 
re-evaluation of the mask using the best K policies occurs 
in each iteration of MMCS (Line [r8] l. 


The above process is repeated until a computational budget 
is reached, after which the best policy is selected as the 
approximate solution to the Dec-POSMDP (Line [20| . Our 
algorithm offers the advantage of balancing exploration and 
exploitation of the search space. Although ‘masking’ places 
focus on promising policies, it is done in conjunction with 
random sampling, allowing previously unexplored regions of 
the policy space to be sampled. 















V. Experiments 


In this section, we consider a package delivery under 
uncertainty scenario involving a team of heterogeneous 
robots, an application which has recently received particular 
attention [2], [7]. The overall objective in this problem is to 
retrieve and deliver packages from base locations to delivery 
locations using a group of robots. 

Though our method is general and can be used for many 
decentralized planning problems, this domain was chosen 
due to its extreme complexity. For instance, the policy space 
of the posed problem has cardinality 5.622e+17, making it 
computationally intractable to solve. Additional challenges 
stem from the presence of different sources of uncertainty 
(wind, actuator, sensor), obstacles and constraints in the 
environment, and different types of tasks (pick-up, drop¬ 
off, etc.). Also, the presence of joint tasks such as joint 
pickup of large packages introduces a significant multi-robot 
coordination component. This problem is formulated as a 
Dec-POMDP in continuous space, and hence current Dec- 
POMDP methods are not applicable. Even if we could mod¬ 
ify the domain to allow them to be used, simply discretizing 
the state/action/observation space in this problem would lead 
to poor solution quality or a computationally intractable Dec- 
POMDP. 

Fig. E illustrates the package delivery domain. Robots 
are classified into two categories: air vehicles (quadcopters) 
and ground vehicles (trucks). Air vehicles handle pickup 
of packages from bases, and can also deliver packages to 
two delivery locations, Desti and Dest 2 - An additional 
delivery location Destr exists in a regulated airspace, where 
air vehicles cannot fly. Instead, packages destined for the 
regulated airspace zone must be handed off to a ground 
vehicle at a rendezvous location. The ground vehicle is solely 
responsible for deliveries in this regulated region. Rewards 
are given to the team only when a package is dropped off at 
its correct delivery destination. 

Packages are available for pickup at two bases in the 
domain. Each base contains a maximum of one package, 
and a stochastic generative model is used for allocating 
packages to bases. Each package has a designated delivery 
location, S G A = {di,d 2 ,dr}. Additionally, each base has 
a size descriptor for its package, ip G 4' = {0,1,2}, where 
ijj = 0 indicates no package at the base, ip = 1 indicates 
a small package, and ip — 2 indicates a large package. 
Small packages can be picked up by a single air vehicle, 
whereas large packages require cooperative pickup by two air 
vehicles. The descriptors of package destinations and sizes 
will implicitly impact the policy of the decentralized planner. 

To allow coordination of cooperative TMAs, an environ¬ 
ment variable stating the availability of a nearby vehicle, 
(p G ^ = {0,1}, is observable by robots at any base or at 
the rendezvous location, where (p = \ signifies that another 
robot is at the same milestone (or location) as the current 
robot and (p = 0 signifies that all other robots are outside of 
some radius of the current robot’s location. 

An robot at any base location can, therefore, observe the 
environmental state cc® = {ip,6,(p) G 4' x A x $, which 
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Fig. 3. Partial segment of a single robot’s policy obtained using MMCS 
for the package delivery domain. In this discrete policy controller, nodes 
represent TMAs and edges represent e-states. Greyed out edges represent 
connections to additional nodes which have been excluded to simplify the 
figure. 


contains details about the availability and the size of the 
package at the base (if it exists), the delivery destination of 
the package, and availability of nearby robots (for performing 
cooperative tasks). An robot at the rendezvous location can 
observe a;® = </) G 4> for guidance of rendezvous TMAs. 

In this mission, we assume one ground robot and two air 
robots are used. We consider two base locations Basei and 
Base 2 - Air robots are initially located at Basei and the 
ground robot is at Destr- If I’* G Basej, we say the i-th 
robot is at the j-th base. 

The available TMAs in this domain are: 

• Go to base Basej for j G (1, 2} 

- robots involved: 1 air robot. 

- Initiation set: Air robot i is available, where i G 

{^a,l; ^a,2 }■ 

- Termination set: robot i available and its belief is 

6* G B’^i . 

• Go to delivery destination Destj for j G {l,2,r} 

- robots involved: 1 (any type) 

- Initiation set: robot i available, can be at any 
location. 

- Termination set: robot i available and its belief is 

6* G B^i . 

• Joint go to delivery destination Destj for } G {1, 2} 

- robots involved: 2 air robots. 

- Initiation set: Air robots ia,i and 2 are available. 

- Termination set: robots ia,i and Zq 2 are available 
and their beliefs are g B‘^i, G B‘^g 
where j G (1, 2}. 

• Pick up package 

- robots involved: 1 air robot 

- Initiation set: Air robot i is available, where i G 

{4,1, 4 , 2 }. = {ip = 1,S G A,cp G 

- Termination set: Ground robot i is carrying package 
and is unavailable. 

• Joint pick up package 

- robots involved: 2 air robots. 








- Initiation set: Air robots and 2 are available. 

X® = = 2, (5 s A, 0 = exact] 

- Termination set: robots 1 and 2 are carrying 
package and are unavailable. 

• Put down package 

- robots involved: 1 ground robot or 1 air robot. 

- Initiation set: robot i G I is carrying package, 6* G 

where j G {1,2} 

- Termination set: robot i is available. 

• Joint put down package 

- robots involved: 2 air robots. 

- Initiation set: Air robots and 1^,2 are available 
and jointly carrying a package. G and IT^ G 
B‘^^, where j G {1,2}. 

- Termination set: robots ia,i and Zq 2 are available 

and their beliefs are G 6*“’^ G B‘^^, 

where j G {1, 2}. 

• Go to rendezvous location 

- robots involved: 1 ground robot or 1 air robot 

- Initiation set: robot i G I is available. 

- Termination set: robot i is available and its belief 
is If G B'^’’. 

• Place package on truck 

- robots involved: 1 air robot, 1 ground robot. 

- Initiation set: Air robot ia G {la, 1 , la, 2 } and 
ground robot ig are available and located at ren¬ 
dezvous location, where G B’^'^ and G B'^^ . 

- Termination set: robots ia and ig are available and 
their beliefs are G B‘^^ , G B‘^^ , where 7 G 
{1,2}. 

• Wait at current location 

- robots involved: 1 ground robot or 1 air robot. 

- Initiation set: robot i G I is available. 

- Termination set: robot i is available. 


Each TMA is defined with an associated initiation and ter¬ 
mination set. For instance, “Go to Base/’ TMA is defined: 

• robots involved: 1 air robot. 

• Initiation set: An air robot i is available. 

• Termination set: robot i available and G Basej. 
The robot’s belief and the environmental state affect the 
TMA’s behavior when undertaken. For instance, the behavior 
of the “Go to Base/’ TMA will be affected by the presence 
of obstacles for different robots. For details on the definition 
of other TMAs, see [16]. 

To generate a closed-loop policy corresponding to each 
TMA, we follow the procedure explained in Section III For 
example, for the “Go to Base/’ TMA, the state of system 
consists of the air robot’s pose (position and orientation). 
Thus, the belief space for this TMA is the space of all 
possible probability distributions over the system’s pose. To 
follow the procedure in Section]^ we incrementally sample 
beliefs in the belief space, design corresponding FMAs, 
generate a graph of FMAs, and solve dynamic programming 
on this graph to generate the TMA policy. Fig. shows 
the performance of an example TMA (“Go to Dest/’). The 


results show that the macro-action is optimized to achieve the 
goal in the manner that produces the highest value, but the 
performance is robust to noise and tends to minimize the con¬ 
straint violation probability. The most important observation 
is that the value function is available over the entire space. 
The same is true for the success probability and completion 
time of the TMA. This information can then be directly used 
by the MMCS algorithm to perform evaluation of policies 
that use this macro-action. 

A portion of an example policy for a single air robot 
in the package delivery domain is illustrated in Fig. 

The policy involves going to Basei and observing x®. 
Subsequently, the robot chooses to either pick up the package 
(alone or with another robot) or go to Base 2 - The full 
policy controller includes more nodes than shown in Fig. 
1^ (for all possible TMAs) as well as edges (for all possible 
environment observations). 

Fig. 0 compares a uniform random Monte Carlo search 
to MMCS in the package delivery domain. Results for the 
Monte Carlo search were generated by repeatedly sampling 
random, valid policies and retaining the policy with the high¬ 
est expected value. Both approaches used a policy controller 
with a fixed number of nodes (Unodes = 13). 

Results indicate that given a fixed computational budget 
(quantified by the number of search iterations), MMCS out¬ 
performs standard Monte Carlo search in terms of expected 
value. Specifically, as seen in Table [Ij after 1000 search 
iterations in the package delivery problem, the expected value 
of the policy from MMCS is 118% higher than the one 
obtained by Monte Carlo search. Additionally, due to this 
problem’s extremely large policy space, determination of an 
optimal joint value through exhaustive search is not possible. 

To more intuitively quantify performance of MMCS and 
Monte Carlo policies in the package delivery domain. Fig. [7] 
compares success probability of delivering a given minimum 
number of packages for both methods, within a fixed mission 
time horizon. Results were generated over 250 simulated runs 
with randomly-generated package types and initial conditions 
for robots. Using the Monte Carlo policy, probability of 
successfully delivering more than 2 packages given a fixed 
time horizon is almost negligible, whereas the policy result¬ 
ing from MMCS successfully delivered a larger number of 



Fig. 4. This figure shows the value of “Go to Desti” TMA over its belief 
space (only the 2D mean part is shown). 













Fig. 5. Comparison of best policy value over 1000 iterations for Monte 
Carlo and MMCS. 
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Fig. 7. Success probability of delivering different numbers of packages 
with a fixed time horizon. Note that this is probability of delivering the 
specified number of packages or more. 



Fig. 6. Comparison of policy sampling differences between Monte Carlo 
and MMCS. MMCS allows exploration of the policy space while exploiting 
knowledge of promising policies. 

packages (in some cases up to 9). 

The experiments indicate that the Dec-POSMDP frame¬ 
work and our MMCS algorithm allow high-quality solutions 
to be generated for complex problems, such as multi-robot 
package delivery, that would not be possible using traditional 
Dec-POMDP approaches. 

VI. Conclusion 

This paper proposed a layered framework to exploit 
the macro-action abstraction in decentralized planning for 
a group of robots acting under uncertainty. It formally 
reduces the Dec-POMDP problem to a Dec-POSMDP that 
can be solved using discrete space search techniques. We 
have additionally formulated an algorithm, MMCS, for 
solving Dec-POSMDPs and have shown its performance on 
a multi-robot package delivery problem under uncertainty. 
The Dec-POSMDP represents a general formalism for 
probabilistic multi-robot coordination problems and our 
results show that high-quality solutions can be automatically 
generated from this high-level domain specification. 
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